Here you introduce the data sources, the intent of the EDA, etc. The data source is https://www.kaggle.com/nickhould/craft-cans/data. data is in two tables, beers and breweries that contains information about different craft beers and craft breweries, respectively. Check codebook.xlsx for more info about the columns.
library(tidyverse)
library(gridExtra)
beers<-read.csv("craft-cans/beers.csv")
breweries<-read.csv("craft-cans/breweries.csv")
Here you explore data structure and clean data if necessary. Note data formats, types and meaning
summary(beers)
## X abv ibu id
## Min. : 0.0 Min. :0.00100 Min. : 4.00 Min. : 1.0
## 1st Qu.: 602.2 1st Qu.:0.05000 1st Qu.: 21.00 1st Qu.: 808.2
## Median :1204.5 Median :0.05600 Median : 35.00 Median :1453.5
## Mean :1204.5 Mean :0.05977 Mean : 42.71 Mean :1431.1
## 3rd Qu.:1806.8 3rd Qu.:0.06700 3rd Qu.: 64.00 3rd Qu.:2075.8
## Max. :2409.0 Max. :0.12800 Max. :138.00 Max. :2692.0
## NA's :62 NA's :1005
## name style
## Nonstop Hef Hop : 12 American IPA : 424
## Dale's Pale Ale : 6 American Pale Ale (APA) : 245
## Oktoberfest : 6 American Amber / Red Ale : 133
## Longboard Island Lager: 4 American Blonde Ale : 108
## 1327 Pod's ESB : 3 American Double / Imperial IPA: 105
## Boston Lager : 3 American Pale Wheat Ale : 97
## (Other) :2376 (Other) :1298
## brewery_id ounces
## Min. : 0.0 Min. : 8.40
## 1st Qu.: 93.0 1st Qu.:12.00
## Median :205.0 Median :12.00
## Mean :231.7 Mean :13.59
## 3rd Qu.:366.0 3rd Qu.:16.00
## Max. :557.0 Max. :32.00
##
As expected, Most American Craft Beers are IPAs, and often are an “American Style” Beer. There are interesting duplicates in the name category, so I’ll need to explore that to make sure there is some reason behind the duplication.
summary(breweries)
## X name city
## Min. : 0.0 Blackrocks Brewery : 2 Portland: 17
## 1st Qu.:139.2 Blue Mountain Brewery : 2 Boulder : 9
## Median :278.5 Lucette Brewing Company: 2 Chicago : 9
## Mean :278.5 Oskar Blues Brewery : 2 Seattle : 9
## 3rd Qu.:417.8 Otter Creek Brewing : 2 Austin : 8
## Max. :557.0 Sly Fox Brewing Company: 2 Denver : 8
## (Other) :546 (Other) :498
## state
## CO : 47
## CA : 39
## MI : 32
## OR : 29
## TX : 28
## PA : 25
## (Other):358
Portland looks to be the top city of craft breweries, but Colorado is the top state. I’m suspicious of Portland’s top spot only because I know there are at least two major cultural cities with the name (ME and OR) - although Portland, OR certainly has a craft brew reputation. There are suspicious duplications in brewery names, so I’ll have to explore that a bit more before visualization as well.
Another note is that while beers has an unnecessary X column that can be dropped, breweries NEEDs the X because it is the primary key. However it should be renamed.
The brewery table needs to strip the whitespace from the state column as well.
beers<-dplyr::select(beers,-X)
breweries<-dplyr::rename(breweries,brewery_id=X)
summary(breweries)
## brewery_id name city
## Min. : 0.0 Blackrocks Brewery : 2 Portland: 17
## 1st Qu.:139.2 Blue Mountain Brewery : 2 Boulder : 9
## Median :278.5 Lucette Brewing Company: 2 Chicago : 9
## Mean :278.5 Oskar Blues Brewery : 2 Seattle : 9
## 3rd Qu.:417.8 Otter Creek Brewing : 2 Austin : 8
## Max. :557.0 Sly Fox Brewing Company: 2 Denver : 8
## (Other) :546 (Other) :498
## state
## CO : 47
## CA : 39
## MI : 32
## OR : 29
## TX : 28
## PA : 25
## (Other):358
beers$brewery_id<-as.factor(beers$brewery_id)
breweries$brewery_id<-as.factor(breweries$brewery_id)
levels(breweries$state) <- trimws(levels(breweries$state))
summary(breweries)
## brewery_id name city state
## 0 : 1 Blackrocks Brewery : 2 Portland: 17 CO : 47
## 1 : 1 Blue Mountain Brewery : 2 Boulder : 9 CA : 39
## 2 : 1 Lucette Brewing Company: 2 Chicago : 9 MI : 32
## 3 : 1 Oskar Blues Brewery : 2 Seattle : 9 OR : 29
## 4 : 1 Otter Creek Brewing : 2 Austin : 8 TX : 28
## 5 : 1 Sly Fox Brewing Company: 2 Denver : 8 PA : 25
## (Other):552 (Other) :546 (Other) :498 (Other):358
beers%>%
filter(name=='Nonstop Hef Hop')
## Warning: package 'bindrcpp' was built under R version 3.4.4
## abv ibu id name style brewery_id
## 1 0.039 20 2400 Nonstop Hef Hop American Pale Wheat Ale 80
## 2 0.039 20 2399 Nonstop Hef Hop American Pale Wheat Ale 80
## 3 0.039 20 2398 Nonstop Hef Hop American Pale Wheat Ale 80
## 4 0.039 20 2397 Nonstop Hef Hop American Pale Wheat Ale 80
## 5 0.039 20 2396 Nonstop Hef Hop American Pale Wheat Ale 80
## 6 0.039 20 2395 Nonstop Hef Hop American Pale Wheat Ale 80
## 7 0.039 20 2394 Nonstop Hef Hop American Pale Wheat Ale 80
## 8 0.039 20 2393 Nonstop Hef Hop American Pale Wheat Ale 80
## 9 0.039 20 2392 Nonstop Hef Hop American Pale Wheat Ale 80
## 10 0.039 20 2391 Nonstop Hef Hop American Pale Wheat Ale 80
## 11 0.039 20 2390 Nonstop Hef Hop American Pale Wheat Ale 80
## 12 0.039 20 2389 Nonstop Hef Hop American Pale Wheat Ale 80
## ounces
## 1 16
## 2 16
## 3 16
## 4 16
## 5 16
## 6 16
## 7 16
## 8 16
## 9 16
## 10 16
## 11 16
## 12 16
There are straight up duplicates. There is no distinguishing difference between the two. Duplicates can be removed.
beers%>%
filter(name=="Dale's Pale Ale")
## abv ibu id name style brewery_id ounces
## 1 0.065 65 1444 Dale's Pale Ale American Pale Ale (APA) 166 12.0
## 2 0.065 65 1252 Dale's Pale Ale American Pale Ale (APA) 166 12.0
## 3 0.065 65 955 Dale's Pale Ale American Pale Ale (APA) 166 19.2
## 4 0.065 65 1 Dale's Pale Ale American Pale Ale (APA) 166 12.0
## 5 0.065 65 1166 Dale's Pale Ale American Pale Ale (APA) 389 19.2
## 6 0.065 65 1065 Dale's Pale Ale American Pale Ale (APA) 389 12.0
This is interesting. There are multiple sizes AND potentially multiple locations for breweries. There are still duplicates, though.
beers%>%
group_by(name)%>%
filter(n()>1)%>%
arrange(name)
## # A tibble: 187 x 7
## # Groups: name [82]
## abv ibu id name style brewery_id ounces
## <dbl> <dbl> <int> <fct> <fct> <fct> <dbl>
## 1 0.051 20 1813 #9 Fruit /… 303 16
## 2 0.051 20 360 #9 Fruit /… 303 12
## 3 0.0560 37 888 1327 Pod's ESB Extra S… 380 12
## 4 0.0560 37 886 1327 Pod's ESB Extra S… 380 12
## 5 0.0560 37 612 1327 Pod's ESB Extra S… 380 12
## 6 0.054 30 2457 312 Urban Pale Ale America… 88 16
## 7 0.054 30 2202 312 Urban Pale Ale America… 88 12
## 8 0.042 18 2201 312 Urban Wheat Ale America… 88 16
## 9 0.042 18 1829 312 Urban Wheat Ale America… 88 12
## 10 0.05 NA 1293 Angry Orchard Crisp Apple Cider 434 16
## # ... with 177 more rows
Complete duplicates should be collapsed, but duplicates on only name need to be handled differently.
summary(distinct(beers,abv,ibu,name,style,brewery_id,ounces,.keep_all=TRUE))
## abv ibu id
## Min. :0.0010 Min. : 4.00 Min. : 4
## 1st Qu.:0.0500 1st Qu.: 21.75 1st Qu.: 814
## Median :0.0570 Median : 35.00 Median :1455
## Mean :0.0599 Mean : 42.79 Mean :1432
## 3rd Qu.:0.0680 3rd Qu.: 64.00 3rd Qu.:2072
## Max. :0.1280 Max. :138.00 Max. :2692
## NA's :62 NA's :1001
## name style
## Oktoberfest : 6 American IPA : 419
## Dale's Pale Ale : 4 American Pale Ale (APA) : 242
## Longboard Island Lager: 3 American Amber / Red Ale : 133
## #9 : 2 American Blonde Ale : 108
## 312 Urban Pale Ale : 2 American Double / Imperial IPA: 103
## 312 Urban Wheat Ale : 2 American Pale Wheat Ale : 85
## (Other) :2358 (Other) :1287
## brewery_id ounces
## 10 : 62 Min. : 8.40
## 25 : 38 1st Qu.:12.00
## 166 : 31 Median :12.00
## 141 : 25 Mean :13.59
## 46 : 24 3rd Qu.:16.00
## 131 : 21 Max. :32.00
## (Other):2176
This does well! Oktoberfest is a generic name, and Dale’s has 4 varietals (potentially).
beers<-distinct(beers,abv,ibu,name,style,brewery_id,ounces,.keep_all=TRUE)
We need to do the same with breweries.
breweries%>%
group_by(name)%>%
filter(n()>1)%>%
arrange(name)
## # A tibble: 14 x 4
## # Groups: name [7]
## brewery_id name city state
## <fct> <fct> <fct> <fct>
## 1 12 Blackrocks Brewery Marquette MI
## 2 95 Blackrocks Brewery Marquette MA
## 3 382 Blue Mountain Brewery Afton VA
## 4 414 Blue Mountain Brewery Arrington VA
## 5 377 Lucette Brewing Company Menominee WI
## 6 456 Lucette Brewing Company Menominie WI
## 7 166 Oskar Blues Brewery Longmont CO
## 8 503 Oskar Blues Brewery Lyons CO
## 9 261 Otter Creek Brewing Waterbury VT
## 10 275 Otter Creek Brewing Middlebury VT
## 11 163 Sly Fox Brewing Company Phoenixville PA
## 12 371 Sly Fox Brewing Company Pottstown PA
## 13 58 Summit Brewing Company St. Paul MN
## 14 138 Summit Brewing Company St Paul MN
Blackrocks is super suspicious, as the city is THE SAME, but the state is different. Checking online, Blackrocks is ONLY in MI.
Blue Mountain is a legitimate split, BM Brewery splits their normal and specialty aged beers between the two locations
Lucette is clearly a misspelling. Hilariously, BOTH are misspellings. Lucette is in Menomonie, WI
Oskar Blues, like BM, actually has two locations (in reality they have several locations and breweries). However I doubt there is a type division
Otter Creek used to be in Waterbury before expanding and moving to Middlebury
Sly fox has two actual locations
Summit has a spelling discrepancy. We want to keep “St. Paul”.
In summary, Only three changes need to be made: to Blackrocks, Lucette, and Summit. Importantly, we need to preserve the joinability of the tables by altering the brewery_id on affected beers.
beers%>%
filter(brewery_id %in% c(12,95))
## abv ibu id name style brewery_id
## 1 0.073 NA 2449 Flying Sailor Rye Beer 95
## 2 0.048 47 2634 Nordskye American IPA 12
## 3 0.060 30 2153 North Third Stout Foreign / Export Stout 12
## 4 0.052 NA 1953 Honey Lav American Pale Wheat Ale 12
## 5 0.068 NA 1496 Coconut Brown Ale American Brown Ale 12
## 6 0.070 51 1481 51K IPA American IPA 12
## 7 0.055 NA 1480 Grand Rabbits Cream Ale 12
## ounces
## 1 12
## 2 12
## 3 12
## 4 12
## 5 12
## 6 12
## 7 12
beers%>%
mutate(brewery_id=replace(brewery_id,brewery_id==95,12))->beers
## Test
beers%>%
filter(brewery_id %in% c(12,95))
## abv ibu id name style brewery_id
## 1 0.073 NA 2449 Flying Sailor Rye Beer 12
## 2 0.048 47 2634 Nordskye American IPA 12
## 3 0.060 30 2153 North Third Stout Foreign / Export Stout 12
## 4 0.052 NA 1953 Honey Lav American Pale Wheat Ale 12
## 5 0.068 NA 1496 Coconut Brown Ale American Brown Ale 12
## 6 0.070 51 1481 51K IPA American IPA 12
## 7 0.055 NA 1480 Grand Rabbits Cream Ale 12
## ounces
## 1 12
## 2 12
## 3 12
## 4 12
## 5 12
## 6 12
## 7 12
beers%>%
filter(brewery_id %in% c(377,456))
## abv ibu id name style brewery_id
## 1 0.052 29 1571 Slow Hand Stout American Stout 377
## 2 0.062 NA 1204 Hips Don't Lie Hefeweizen 456
## 3 0.052 NA 1122 Ride Again Pale Ale American Pale Ale (APA) 456
## 4 0.048 NA 700 The Farmer's Daughter American Blonde Ale 456
## ounces
## 1 16
## 2 16
## 3 16
## 4 16
beers%>%
mutate(brewery_id=replace(brewery_id,brewery_id==377,456))->beers
## Test
beers%>%
filter(brewery_id %in% c(377,456))
## abv ibu id name style brewery_id
## 1 0.052 29 1571 Slow Hand Stout American Stout 456
## 2 0.062 NA 1204 Hips Don't Lie Hefeweizen 456
## 3 0.052 NA 1122 Ride Again Pale Ale American Pale Ale (APA) 456
## 4 0.048 NA 700 The Farmer's Daughter American Blonde Ale 456
## ounces
## 1 16
## 2 16
## 3 16
## 4 16
beers%>%
filter(brewery_id %in% c(58,138))
## abv ibu id name
## 1 0.053 49 2352 Extra Pale Ale
## 2 0.053 40 2549 Make It So
## 3 0.047 55 2473 Hopvale Organic Ale
## 4 0.083 100 2415 Unchained #18 Hop Silo
## style brewery_id ounces
## 1 American Pale Ale (APA) 138 12
## 2 Extra Special / Strong Bitter (ESB) 58 12
## 3 American Pale Ale (APA) 58 16
## 4 American Double / Imperial IPA 58 16
beers%>%
mutate(brewery_id=replace(brewery_id,brewery_id==138,58))->beers
## Test
beers%>%
filter(brewery_id %in% c(58,138))
## abv ibu id name
## 1 0.053 49 2352 Extra Pale Ale
## 2 0.053 40 2549 Make It So
## 3 0.047 55 2473 Hopvale Organic Ale
## 4 0.083 100 2415 Unchained #18 Hop Silo
## style brewery_id ounces
## 1 American Pale Ale (APA) 58 12
## 2 Extra Special / Strong Bitter (ESB) 58 12
## 3 American Pale Ale (APA) 58 16
## 4 American Double / Imperial IPA 58 16
So we will want to remove the breweries with the following IDs: 95, 377, and 138. Additionally, We need to modify Lucette to have to correct spelling. Each has only one outlier
breweries%>%
filter(!brewery_id %in% c(95,377,138))->breweries
## Test
breweries%>%
group_by(name)%>%
filter(n()>1)%>%
arrange(name)
## # A tibble: 8 x 4
## # Groups: name [4]
## brewery_id name city state
## <fct> <fct> <fct> <fct>
## 1 382 Blue Mountain Brewery Afton VA
## 2 414 Blue Mountain Brewery Arrington VA
## 3 166 Oskar Blues Brewery Longmont CO
## 4 503 Oskar Blues Brewery Lyons CO
## 5 261 Otter Creek Brewing Waterbury VT
## 6 275 Otter Creek Brewing Middlebury VT
## 7 163 Sly Fox Brewing Company Phoenixville PA
## 8 371 Sly Fox Brewing Company Pottstown PA
Now just to fix Menomonie
breweries%>%
filter(state=='WI')
## brewery_id name city state
## 1 32 James Page Brewing Company Stevens Point WI
## 2 73 Ale Asylum Madison WI
## 3 120 Horny Goat Brew Pub Milwaukee WI
## 4 131 Stevens Point Brewery Stevens Point WI
## 5 134 Minhas Craft Brewery Monroe WI
## 6 146 Angry Minnow Hayward WI
## 7 192 Capital Brewery Middleton WI
## 8 211 Wisconsin Brewing Company Verona WI
## 9 228 Geneva Lake Brewing Company Lake Geneva WI
## 10 284 Milwaukee Brewing Company Milwaukee WI
## 11 349 Hinterland Brewery Green Bay WI
## 12 406 Lazy Monk Brewing Eau Claire WI
## 13 425 Ciderboys Stevens Point WI
## 14 456 Lucette Brewing Company Menominie WI
## 15 486 Stillmank Beer Company Green Bay WI
## 16 498 Sprecher Brewing Company Glendale WI
## 17 529 Northwoods Brewpub Eau Claire WI
## 18 542 Angry Minnow Brewing Company Hayward WI
## 19 554 Dave's Brewfarm Wilson WI
breweries%>%
filter(city=="Menominie")
## brewery_id name city state
## 1 456 Lucette Brewing Company Menominie WI
breweries%>%
mutate(city = dplyr::recode(city, "Menominie" = "Menomonie"))->breweries
## Test
breweries%>%
filter(brewery_id==456)
## brewery_id name city state
## 1 456 Lucette Brewing Company Menomonie WI
Lets take one more look at the data summaries before we break into visualization
summary(beers)
## abv ibu id
## Min. :0.0010 Min. : 4.00 Min. : 4
## 1st Qu.:0.0500 1st Qu.: 21.75 1st Qu.: 814
## Median :0.0570 Median : 35.00 Median :1455
## Mean :0.0599 Mean : 42.79 Mean :1432
## 3rd Qu.:0.0680 3rd Qu.: 64.00 3rd Qu.:2072
## Max. :0.1280 Max. :138.00 Max. :2692
## NA's :62 NA's :1001
## name style
## Oktoberfest : 6 American IPA : 419
## Dale's Pale Ale : 4 American Pale Ale (APA) : 242
## Longboard Island Lager: 3 American Amber / Red Ale : 133
## #9 : 2 American Blonde Ale : 108
## 312 Urban Pale Ale : 2 American Double / Imperial IPA: 103
## 312 Urban Wheat Ale : 2 American Pale Wheat Ale : 85
## (Other) :2358 (Other) :1287
## brewery_id ounces
## 10 : 62 Min. : 8.40
## 25 : 38 1st Qu.:12.00
## 166 : 31 Median :12.00
## 141 : 25 Mean :13.59
## 46 : 24 3rd Qu.:16.00
## 131 : 21 Max. :32.00
## (Other):2176
str(beers)
## 'data.frame': 2377 obs. of 7 variables:
## $ abv : num 0.05 0.066 0.071 0.09 0.075 0.077 0.045 0.065 0.055 0.086 ...
## $ ibu : num NA NA NA NA NA NA NA NA NA NA ...
## $ id : int 1436 2265 2264 2263 2262 2261 2260 2259 2258 2131 ...
## $ name : Factor w/ 2305 levels "#001 Golden Amber Lager",..: 1638 577 1704 1842 1819 268 1160 758 1093 486 ...
## $ style : Factor w/ 100 levels "","Abbey Single Ale",..: 19 18 16 12 16 80 18 22 18 12 ...
## $ brewery_id: Factor w/ 558 levels "0","1","2","3",..: 409 178 178 178 178 178 178 178 178 178 ...
## $ ounces : num 12 12 12 12 12 12 12 12 12 12 ...
summary(breweries)
## brewery_id name city
## 0 : 1 Blue Mountain Brewery : 2 Portland: 17
## 1 : 1 Oskar Blues Brewery : 2 Boulder : 9
## 2 : 1 Otter Creek Brewing : 2 Chicago : 9
## 3 : 1 Sly Fox Brewing Company : 2 Seattle : 9
## 4 : 1 10 Barrel Brewing Company: 1 Austin : 8
## 5 : 1 18th Street Brewery : 1 Denver : 8
## (Other):549 (Other) :545 (Other) :495
## state
## CO : 47
## CA : 39
## MI : 32
## OR : 29
## TX : 28
## PA : 25
## (Other):355
str(breweries)
## 'data.frame': 555 obs. of 4 variables:
## $ brewery_id: Factor w/ 558 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ name : Factor w/ 551 levels "10 Barrel Brewing Company",..: 355 12 266 319 201 136 227 477 59 491 ...
## $ city : Factor w/ 384 levels "Abingdon","Abita Springs",..: 228 200 122 299 300 62 91 48 152 136 ...
## $ state : Factor w/ 51 levels "AK","AL","AR",..: 24 18 20 5 5 41 6 23 23 23 ...
abvcolor<-aes(color=I('gray'),fill=I('tomato'))
ggplot(data=beers,aes(x=abv))+
geom_histogram(binwidth=.002,abvcolor)
## Warning: Removed 62 rows containing non-finite values (stat_bin).
Many beers have ABV of .05. 5% is pretty standard beer alcohol content. It skews right - and I think this is because craft beers generally try to be unique and have something that makes it different than a domestic. Higher alcohol content is a simple way to justify the craft value.
ibucolor<-aes(color=I('gray'),fill=I('limegreen'))
ggplot(data=beers,aes(x=ibu))+
geom_histogram(ibucolor,binwidth = 5)
## Warning: Removed 1001 rows containing non-finite values (stat_bin).
IBUs measure the bitterness. There was (maybe still is) a IPA craze in craft beer. It is a weird machismo thing to max out bitterness at the expense of balance. It is a common way to distinguish yourself - check the peak at 100. The otherwise bimodality is likely due to most beer types have IBUs under 25, but IPAs and other high-hopped beers seek for near 75.
ozcolor<-aes(color=I('gray'),fill=I('cadetblue'))
ggplot(data=beers,aes(x=ounces))+
geom_bar(ozcolor)+
scale_x_continuous(breaks=seq(0,32,4))
There are very common beer can sizes. 12 oz, 16 oz etc. There are some oddballs.
ggplot(data=beers,aes(x=reorder(style,style,function(x)+length(x))))+
geom_bar()+
scale_y_log10()+
coord_flip()
Note the log scale.American style beers dominate the american craft brew scene. It is noteworthy that lagers are technically more difficult and expensive to make, and a lot of craft breweries brew ales. It is also noteworthy that there are around 10 beers without a style attached to them.
joineddata<-dplyr::left_join(beers,breweries, by="brewery_id",suffix=c(".beer",".brewer"))
summary(joineddata)
## abv ibu id
## Min. :0.0010 Min. : 4.00 Min. : 4
## 1st Qu.:0.0500 1st Qu.: 21.75 1st Qu.: 814
## Median :0.0570 Median : 35.00 Median :1455
## Mean :0.0599 Mean : 42.79 Mean :1432
## 3rd Qu.:0.0680 3rd Qu.: 64.00 3rd Qu.:2072
## Max. :0.1280 Max. :138.00 Max. :2692
## NA's :62 NA's :1001
## name.beer style
## Oktoberfest : 6 American IPA : 419
## Dale's Pale Ale : 4 American Pale Ale (APA) : 242
## Longboard Island Lager: 3 American Amber / Red Ale : 133
## #9 : 2 American Blonde Ale : 108
## 312 Urban Pale Ale : 2 American Double / Imperial IPA: 103
## 312 Urban Wheat Ale : 2 American Pale Wheat Ale : 85
## (Other) :2358 (Other) :1287
## brewery_id ounces name.brewer
## 10 : 62 Min. : 8.40 Brewery Vivant : 62
## 25 : 38 1st Qu.:12.00 Oskar Blues Brewery : 44
## 166 : 31 Median :12.00 Sun King Brewing Company : 38
## 141 : 25 Mean :13.59 Cigar City Brewing Company: 25
## 46 : 24 3rd Qu.:16.00 Sixpoint Craft Ales : 24
## 131 : 21 Max. :32.00 Stevens Point Brewery : 21
## (Other):2176 (Other) :2163
## city state
## Grand Rapids: 66 CO : 261
## Chicago : 55 CA : 181
## Portland : 53 MI : 163
## Indianapolis: 43 IN : 139
## San Diego : 42 TX : 130
## Boulder : 41 OR : 114
## (Other) :2077 (Other):1389
top20brew<-joineddata %>%
count(brewery_id) %>%
top_n(20)
ggplot(data=subset(joineddata,brewery_id %in% top20brew$brewery_id),
aes(x=reorder(name.brewer,name.brewer, function(x)+length(x))))+
geom_bar()+
coord_flip()
Some breweries are going crazy pumping out different beers.
I don’t think looking at name will yield much, as the names are largely unique.
Now lets look at the breweries table closer.
statecolor=aes(color=I('gray'),fill=I('mediumpurple'))
ggplot(data=breweries, aes(x=reorder(state,state, function(x)+length(x))))+
geom_bar(statecolor)+
coord_flip()
Colorado is king! Colorado has a huge brewing history, as well as Michigan. California is just huge. Oregon has a reputation for craft breweries, especially in Portland.
Speaking of Portland, there are two important Portlands, so I want to check if duplicate city names in different states is a factor. likely just need to create a new variable that combines the two.
breweries %>%
mutate(citystate=interaction(city,state,sep=", "))->breweries
summary(breweries)
## brewery_id name city
## 0 : 1 Blue Mountain Brewery : 2 Portland: 17
## 1 : 1 Oskar Blues Brewery : 2 Boulder : 9
## 2 : 1 Otter Creek Brewing : 2 Chicago : 9
## 3 : 1 Sly Fox Brewing Company : 2 Seattle : 9
## 4 : 1 10 Barrel Brewing Company: 1 Austin : 8
## 5 : 1 18th Street Brewery : 1 Denver : 8
## (Other):549 (Other) :545 (Other) :495
## state citystate
## CO : 47 Portland, OR : 11
## CA : 39 Boulder, CO : 9
## MI : 32 Chicago, IL : 9
## OR : 29 Seattle, WA : 9
## TX : 28 San Diego, CA: 8
## PA : 25 Denver, CO : 8
## (Other):355 (Other) :501
We can see that the number of “Portland” cities is not equal to the number of “Portland, OR”, meaning that Portland, ME has a few craft breweries.
top10citystate<-breweries %>%
count(citystate) %>%
top_n(10)
ggplot(data=subset(breweries,citystate %in% top10citystate$citystate),
aes(x=reorder(citystate,citystate, function(x)+length(x))))+
geom_bar(statecolor)+
coord_flip()
Both Portlands are in the top 10.
standard relational dataset.
Things like ABV and IBU are important, as well as location.
style will be interesting to correlate with ABV and IBU.
I created a citystate variable to account for same city names in different states
There is a slight tail on abv, which makes sense. Some beers are high alcohol content by design. There is a bimodality (or tri-modality) in IBUs, which is likely tied to unique beer styles that are know for bitterness, like IPAs.
I should’ve just done this in the beginning, but I’ll work exclusively on the joined data now.
joineddata<-dplyr::left_join(beers,breweries, by="brewery_id",suffix=c(".beer",".brewer"))
summary(joineddata)
## abv ibu id
## Min. :0.0010 Min. : 4.00 Min. : 4
## 1st Qu.:0.0500 1st Qu.: 21.75 1st Qu.: 814
## Median :0.0570 Median : 35.00 Median :1455
## Mean :0.0599 Mean : 42.79 Mean :1432
## 3rd Qu.:0.0680 3rd Qu.: 64.00 3rd Qu.:2072
## Max. :0.1280 Max. :138.00 Max. :2692
## NA's :62 NA's :1001
## name.beer style
## Oktoberfest : 6 American IPA : 419
## Dale's Pale Ale : 4 American Pale Ale (APA) : 242
## Longboard Island Lager: 3 American Amber / Red Ale : 133
## #9 : 2 American Blonde Ale : 108
## 312 Urban Pale Ale : 2 American Double / Imperial IPA: 103
## 312 Urban Wheat Ale : 2 American Pale Wheat Ale : 85
## (Other) :2358 (Other) :1287
## brewery_id ounces name.brewer
## 10 : 62 Min. : 8.40 Brewery Vivant : 62
## 25 : 38 1st Qu.:12.00 Oskar Blues Brewery : 44
## 166 : 31 Median :12.00 Sun King Brewing Company : 38
## 141 : 25 Mean :13.59 Cigar City Brewing Company: 25
## 46 : 24 3rd Qu.:16.00 Sixpoint Craft Ales : 24
## 131 : 21 Max. :32.00 Stevens Point Brewery : 21
## (Other):2176 (Other) :2163
## city state citystate
## Grand Rapids: 66 CO : 261 Grand Rapids, MI: 66
## Chicago : 55 CA : 181 Chicago, IL : 55
## Portland : 53 MI : 163 Indianapolis, IN: 43
## Indianapolis: 43 IN : 139 San Diego, CA : 42
## San Diego : 42 TX : 130 Boulder, CO : 41
## Boulder : 41 OR : 114 Portland, OR : 41
## (Other) :2077 (Other):1389 (Other) :2089
Start with a ggpairs, as it plots all the bivariates at once.
library(GGally)
library(scales)
library(memisc)
ggpairs(dplyr::select(joineddata,
-ends_with("id"),-starts_with("name."),-starts_with("city"),-style,-state),
lower = list(continuous = wrap("points", shape = I('.'))),
upper = list(combo = wrap("box", outlier.shape = I('.'))))
Unfortunately, most of the data has a huge number of factors and isn’t ameniable to pairplotting. But this is still useful.
ggplot(data=joineddata,aes(x=abv,y=ibu))+
geom_point(alpha=.5)+
geom_smooth(method="lm",formula=y~x)+
geom_density2d()
## Warning: Removed 1001 rows containing non-finite values (stat_smooth).
## Warning: Removed 1001 rows containing non-finite values (stat_density2d).
## Warning: Removed 1001 rows containing missing values (geom_point).
with(joineddata,cor.test(abv,ibu))
##
## Pearson's product-moment correlation
##
## data: abv and ibu
## t = 33.284, df = 1374, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6377807 0.6963675
## sample estimates:
## cor
## 0.6681083
There is a correlation between ABV and IBU. There is no technical reason for these to be correlated, as alcohol content and ibu are adjusted with different components of beer. However, it is a competitive advantage in niche craft markets to be exceptional. the hoppiest and most alcoholic beer is a branding edge that a small craft brewery would strive to have.
ggplot(data=joineddata,aes(x=reorder(state,-abv,median,na.rm = TRUE), y=abv))+
geom_jitter(alpha=.3,color='tomato')+
geom_boxplot(alpha=.5,fill='tomato')+
stat_summary(fun.y="mean",geom="point",color="green",shape=8,size=2)+
coord_flip()
## Warning: Removed 62 rows containing non-finite values (stat_boxplot).
## Warning: Removed 62 rows containing non-finite values (stat_summary).
## Warning: Removed 62 rows containing missing values (geom_point).
Utah has a oddball beer law, saying beer must be 4% ABV max, whihc explains their location! Many of the states have pretty wide spreads.
ggplot(data=joineddata,aes(x=reorder(state,-ibu,median,na.rm = TRUE), y=ibu))+
geom_jitter(alpha=.3,color='limegreen')+
geom_boxplot(alpha=.5,fill='limegreen')+
stat_summary(fun.y="mean",geom="point",color="red",shape=8,size=2)+
coord_flip()
## Warning: Removed 1001 rows containing non-finite values (stat_boxplot).
## Warning: Removed 1001 rows containing non-finite values (stat_summary).
## Warning: Removed 1001 rows containing missing values (geom_point).
There is an interesting narrative of having mostly midwestern states near the bottom of IBUs, and having southern and costal states near the top. it isnt a perfect narrative.
This needs to be loaded up into tableau for some interesting maps.
Another important thing to do is to actually make a region variable! If we categorize the states into the four regions: Northeast, South, Midwest, and West; and into divisions: New England, Mid-atlantic, East North Central, West North Central, South atlantic, East South Central, West South Central, Mountain, and Pacific; we may see cultural or regional differences with better clarity.
state<- unique(joineddata$state)
divisions<-data.frame(state)
divisions<-divisions%>%
mutate(division=case_when(state %in% c("CT","ME","MA","NH","RI","VT") ~ "New England",
state %in% c("NJ","NY","PA") ~ "Mid-Atlantic",
state %in% c("WV","MD","DE","DC","VA","NC","SC","GA","FL") ~ "South Atlantic",
state %in% c("KY","TN","MS","AL") ~ "East South Central",
state %in% c("MI","OH","IN","IL","WI") ~ "East North Central",
state %in% c("MN","IA","MO","ND","SD","NE","KS") ~ "West North Central",
state %in% c("OK","AR","LA","TX") ~ "West South Central",
state %in% c("CO","NM","WY","MT","ID","UT","AZ","NV") ~ "Mountain",
state %in% c("WA","OR","CA","AK","HI") ~ "Pacific",
TRUE ~ "OTHER"))
divisions<-divisions%>%
mutate(region=case_when(division %in% c("New England","Mid-Atlantic") ~ "Northeast",
division %in% c("South Atlantic","East South Central","West South Central") ~"South",
division %in% c("East North Central","West North Central") ~"Midwest",
division %in% c("Mountain","Pacific") ~ "West",
TRUE ~ "OTHER"))
Now we can reupdate the joineddata by joining our new table!
joineddata<-joineddata %>%
left_join(divisions,by="state")
Lets look at divisions and regions for ABV and IBU
ggplot(data=joineddata,aes(x=reorder(division,-abv,median,na.rm = TRUE), y=abv))+
geom_jitter(alpha=.3,color='tomato')+
geom_boxplot(alpha=.5,fill='tomato')+
stat_summary(fun.y="mean",geom="point",color="green",shape=8,size=2)+
coord_flip()
## Warning: Removed 62 rows containing non-finite values (stat_boxplot).
## Warning: Removed 62 rows containing non-finite values (stat_summary).
## Warning: Removed 62 rows containing missing values (geom_point).
Every division seems to keep it pretty close to 6%, but it looks like the southwestern states like slightly lower abv than the great lakes. Maybe more ice beers in the colder climate?
ggplot(data=joineddata,aes(x=reorder(division,-ibu,median,na.rm = TRUE), y=ibu))+
geom_jitter(alpha=.3,color='limegreen')+
geom_boxplot(alpha=.5,fill='limegreen')+
stat_summary(fun.y="mean",geom="point",color="red",shape=8,size=2)+
coord_flip()
## Warning: Removed 1001 rows containing non-finite values (stat_boxplot).
## Warning: Removed 1001 rows containing non-finite values (stat_summary).
## Warning: Removed 1001 rows containing missing values (geom_point).
This is neat, the pacific coast states have a huge range in IBUs, while the heartland states maybe don’t like the bitterness for the sake of bitterness.
ggplot(data=joineddata,aes(x=reorder(region,-abv,median,na.rm = TRUE), y=abv))+
geom_jitter(alpha=.3,color='tomato')+
geom_boxplot(alpha=.5,fill='tomato')+
stat_summary(fun.y="mean",geom="point",color="green",shape=8,size=2)+
coord_flip()
## Warning: Removed 62 rows containing non-finite values (stat_boxplot).
## Warning: Removed 62 rows containing non-finite values (stat_summary).
## Warning: Removed 62 rows containing missing values (geom_point).
the differences get even smaller. The west coast seems to be more experimental than the other areas. Likely just a factor of number of breweries/beers.
ggplot(data=joineddata,aes(x=reorder(region,-ibu,median,na.rm = TRUE), y=ibu))+
geom_jitter(alpha=.3,color='limegreen')+
geom_boxplot(alpha=.5,fill='limegreen')+
stat_summary(fun.y="mean",geom="point",color="red",shape=8,size=2)+
coord_flip()
## Warning: Removed 1001 rows containing non-finite values (stat_boxplot).
## Warning: Removed 1001 rows containing non-finite values (stat_summary).
## Warning: Removed 1001 rows containing missing values (geom_point).
Looks like the same trend holds, the midwest likes the more balanced bitterness (influenced by budweiser in STL maybe?) while the west coast likes to experiment more.
ggplot(data=joineddata,aes(x=abv,y=ibu,color=region))+
geom_point(alpha=.5)+
geom_smooth(method="lm",formula=y~x)
## Warning: Removed 1001 rows containing non-finite values (stat_smooth).
## Warning: Removed 1001 rows containing missing values (geom_point).
Basically no distinction within these regions, except that the south and the midwest have a few particularly high ABV choices.
ggplot(data=joineddata,aes(x=abv,y=ibu,color=division))+
geom_point(alpha=.5)
## Warning: Removed 1001 rows containing missing values (geom_point).
optional part, use a model to try to fit data.
#
# m1 <- polr(quality ~ variable1, data = edadata, Hess=TRUE)
# m2 <- update(m1, ~ . + variable2)
# m3 <- update(m2, ~ . + variable3)
#
# mtable(m1, m2, m3)
# copied from above
Highlight the most impactful and interesting plots from above, giving a standalone discription of the plot and what it shows.
Reflect on data quality - the difficulties and deficits of the data - and potential options to expand data collection to remedy deficits.
Reflect on data structure that appeals to common beliefs about the data, structure that challenges preconceived notions, and most importantly any surprising and un-thought-of structure that stands out.
Note on next steps (collect more data, further exploration, machine learning)